For this investigation, game data from the AFL, the professional league for Australian Rules Football will be interogated. AFL is over 100 years old, and is a team game with 22 players a side, where players aim to outscore the opposition, where they are awarded six points for a goal, or one point if the kick misses the goal to score a ‘behind’ (yes, a sport where you get points for missing!). For more information, I encourage you to check the usual suspects, Wikipedia and YouTube.

The data for this analysis has been scraped from the AFL website, using a Python script contained in this same repo. Appropriate queries return json documents, which have been lightly modified and formatted into csvs of tidy data using Pandas. The data comes from all matches from 2001 - 2015, although not all statistics are recorded for earlier seasons.

There are three data sets that have been collected from the website: - Match Data (md), a list of every match played, with key results about the match. - Player Match Stats (pms), contains a record for each players performance for each match, with vital statistics including, kicks, handballs, tackles, marks, hitouts, goals and behinds. - Player Summary Data (psum), key biographical data for each player, including height, weight, date of birth, draft year and age. This dataset incomplete for some variables.

We’re also going to construct one extra dataframe, which is a aggregate of certain player stats, combined with the mean of their game statistics across the period. A subset of this dataset, including only players with more than 20 games, has also been built.

In general the purpose of this investigation is to look at the relationship between player’s ‘properties’; weight, height and playing position; and their playing statistics. Furthermore, general trends in the game will also be explored as they arise.

Data Quality

Looking at whether the data is well formed, we can investigate the number of home games by team. We observe low counts for two teams, GCFC and GWS, who only joined the competition 2-3 years ago. There also appear to be some games with missing labels. On closer inspection, these 49 games are missing all details, and occur inconsistently across the period 2001 to 2011.

Now lets check the quality of the data on a between different sets. The games played correlates fairly well with the number of records in the database, except for players with a lot of games played (who debuted before 2001), and players who have 0 recorded games.

When we look at the dates of birth, draft and debut, the data is clearly incomplete. The date of birth would be expected to be fairly consistent by year, but clearly many older players don’t have DoB recorded. Additionally, draft and debut information only appears to be available for younger players.

EDA

Univariate Plots Section

First lets investigate the nature of some key variables. First, the number of recorded games per player. Results appear to be well formed, with no obvious outliers. Note that 25% of players play only 14 games or less in this period; very few players actually manage to build a career in the game; and the median number of games is only 44. The maximum number of games recorded is 330, which fits well with our teams playing around 170 home games over the period (along with around 170 away games).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   14.00   44.50   69.86  111.20  330.00

Look at the distribution of height and weight for the players. The height distribution looks almost normal, although there are twice as many players who are 180cm compared to 179cm - clearly some rounding happening there. Also an outlier at 200cm, but curiously not at 190cm. Weight distribution is similar, but with slightly longer tails. Same rounding anomolies at 80kg and 100kg. No big outliers here.

Now visualise the distributions of different game statistics, for instance kicks and handballs. These both appear to be similarly distributed; the distribution appears to be similar to log-normal or weibull. Most players appear to have more kicks (median=9) than handballs (median=6).

Most players score 0 goals per game, which is unsurprising given an average side will score around 15 goals in a game, shared amongst 22 players. Transforming the y-axis by log10 reveals the long tail of the distribution.

Hitouts are a unique part of the game, where the ball is bounced or thrown up by the umpire, and a key-position player named the ruck will attempt to tap the ball to a player on their side, known as a hitout. Because only a handful of tall players will play in the ruck, there are a large number of players with no hitouts. Beyond that the tail is very long, but decreases sumwhat rapidly above 20 hitouts.

The long tail for hitouts is evident when you look at the quantiles associated with the parameter:

quantile(pms$hitouts, c(0.80,0.9,0.95,0.99, 0.999))
##   80%   90%   95%   99% 99.9% 
##     0     3    12    29    45

Finally we can look at how the aggregate statistics compare to the individual game statistics. As expected, there is a reversion towards the mean due to large number theory. There is a median of 9 per player per game across the entire dataset, but this drops down to a median of 8.0451381 kicks per game for each player when their career is taken in aggregate.

Univariate Analysis

What is the structure of your dataset?

There are three data sets that have been collected from the website: - Match Data (md), a list of every match played, with key results about the match. - Player Match Stats (pms), contains a record for each players performance for each match, with vital statistics including, kicks, handballs, tackles, marks, hitouts, goals and behinds. - Player Summary Data (psum), key biographical data for each player, including height, weight, date of birth, draft year and age. This dataset incomplete for some variables.

What is/are the main feature(s) of interest in your dataset?

Statistics for players across different games,.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Players simple physical characteristics (weight, height), and their number of games experience, to see if these influence the occurence of them performing certain roles.

Did you create any new variables from existing variables in the dataset?

A simple numerical indicator for each game was built as the composite of the year and round number, to allow for easier time series plotting without giant gaps in the off-season.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A new aggregated dataset was constructed to reduce the volume of data, and to better gauge players average performances. Additionally, the number of positions was reduced from 24 to 9 to simplify interpretation. Otherwise the data was well formatted during the scraping and data wrangling in Python.

Bivariate Plots Section

To kickstart the investigation into relationships between variables, multiple scatterplot matrices were built between various combinations of variables.

When we start looking at highly correlated values, we see some pairs which we would probably expect to be well correlated, e.g. goals and behinds (0.93) and marks inside 50m with goals (0.93). This is probably expected, as expected given players who take shots on goal more often will kick more behinds. Additionally, marks inside 50m are also highly correlated to goals, since taking a mark inside 50m gives you a clear shot on goal from where the mark was taken.

One interesting relation to look at is the free kicks a player receives or concedes, which occurs typically for ‘foul’ play. When we look at the relationship between free kicks for and against individual players we can see a moderate correlation of 0.47. We notice that although it balances out for most players, there is still a handful of ‘dirty’ players who concede 2-4 times the number of free kicks as they receive.

There’s a moderately strong correlation seen between tackles and clearances (0.65), probably because players who play ‘on the ball’ are heavily involved in closer scuffles in the game, hence will rack up both stats.

Hitouts were moderately strongly correlated with height in the previous scatter matrices. However, if we filter on players who are frequently involved in hitouts, i.e. play in that role, by filtering for average hitouts greater than 5, we find that the correlation is less strong. Specifically, after applying the filter the correlation coefficient has dropped from 0.594 to 0.439. Clearly height is only a component in performing well in this role, skill and jump height probably also play an important role.

## 
##  Pearson's product-moment correlation
## 
## data:  hitout_df$heightInCm and hitout_df$hitouts
## t = 5.0795, df = 108, p-value = 1.594e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2744511 0.5787792
## sample estimates:
##       cor 
## 0.4391265

Curiously, there is also weak negative correlation between height and kicks (-0.41), although perhaps this is more related to the role that players have on the ground, rather than anything to do with their skill at kicking.

Part of this may be explained by Rucks, who are typically the tallest players, having fewer disposals (i.e. kicks+handballs) in general.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Two of the most interesting relationships are between player statistics vs height.

Firstly, the once entries are filtered for players with regular hitouts, there is only a weak correlation between height and hitouts. Clearly technique and ability is as important as height for being able to reach the ball for a hitout.

Secondly, the negative correlation between height and kicks was curious. It’s not clear if this is due to taller players having different roles within the team or not.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationships between frees for and against individual players was quite interesting. There are clearly a handful of outliers who have a lot of free kicks called against them, who may be characterised as ‘dirty’ players.

What was the strongest relationship you found?

Goals and behinds, marks inside 50m and goals. These relationships are unsurprising, given a mark inside 50m leads to a shot on goal, and behinds happen when you miss the goal.

Multivariate Plots Section

First we’ll look at the physical characteristic of players in different positions. The scatterplots below include an extra contour at 50% density to give a better representation of where each position tends to be. What we see is that rucks form a definite cluster as the taller players. Also half-forwards have quite a spread, that tends to go spread taller that other positions. Aside from these and the ‘other’ category, most players tend to be clustered around the same area, slightly higher than the general population average male height of 177cm.

Lets take a look at the accuracy of players in kicking goals. In AFL, there are four posts to kick at, the inner goal posts, which will score 6 points, and the outer behind posts, which only score 1 point. The accuracy of a player kicking for goal is the number of goals, divided by shots on goal (i.e. goals+behinds). The accuracy has been evaluated for all players with at least 50 shots on goal, and this has been further subdivided for goals by position and the accuracy of players in each position. Surprisingly, median forwards only seem to be marginally more accurate than non-forwards: median 60% vs 57.8%. However, the long, high tail of the box plot suggests that the best forward are very accurate; over 75%.

## Source: local data frame [8 x 2]
## 
##       position median(accuracy)
##         (fctr)            (dbl)
## 1       Centre        0.5679012
## 2    Half Back        0.5702479
## 3 Half Forward        0.5902778
## 4      Forward        0.6024784
## 5         Back        0.5842697
## 6        Other        0.5733333
## 7        Rover        0.5576923
## 8         Ruck        0.6055046
## [1] 0.5773196

When we colour the scatter plot by position, we see that forwards and half-forwards dominate the high-scoring section of the plot, as expected. Additionally, we can note how the linear regressions for different positions fall into two clusters, one for forward positions and another for the rest.

To revisit the question of why taller players have fewer disposals, we can color by position to see if the position plays a role in the relationship. What we find is that the rucks, who are the tallest players, have fewer kicks. However nearly all the subgroups also appear to have weak to moderate negative correlations too, so it’s not all attributable to position.

## Source: local data frame [8 x 2]
## 
##       position cor(heightInCm, kicks)
##         (fctr)                  (dbl)
## 1       Centre             -0.2708064
## 2    Half Back             -0.2827757
## 3 Half Forward             -0.4590485
## 4      Forward             -0.3183146
## 5         Back             -0.4711789
## 6        Other             -0.3373129
## 7        Rover             -0.2559500
## 8         Ruck             -0.5619785

If we explore this a little further, by looking at the preference of kicks and handballs for players of different positions, we definitely see that rucks are unique in their preference for handballs over kicks.

Finally, lets look at the change in the game over time. By plotting movement in certain statistics over time by team, and as an aggregate, we can observe certain patters. The following plot shows the timeseries trends for kicks and handballs over time, with a black line for league average, grey lines for each team. Additionally, three teams have been highlighted: Collingwood and Geelong, who were strong through the middle of the last decade, and Hawthorn, who are becoming increasingly dominant in the 2010s.

We see that the number of kicks per player per match has increased only slightly over time, although there have been periods where Collingwood (COLL) have played a high-kicking game (2006-2011), but with comparatively few handballs compared with competitors. In general the number of handballs increased significantly over the course of the decade, something practiced heavily by Geelong (GEEL) through 2007-2010, although the trend has receded and stabilised in recent years. Hawthorn appear to be playing a game more similar to the ‘Collingwood model’ more recently, with a high-kicking game.

Throughout the decade there has been a lot of talk about ‘congested’ football, with many players close to the ball with great intensity. This can be seen in the tackle count rising steadily throughout the decade from 2001-2011. Sydney (SYD) have been known throughout this period for playing tight, congested football, which is seen with them leading the league in tackle count frequently throughout this period. Additionally, Sydney prominently led clearances (getting the ball out from a congested zone) throughout much of the decade too, especially in 2005 - 2013, where they were typically high above league averages.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The only position who showed a clear characteristic weight and height were the rucks, who are definitely the tallest players. However many positions were of a similar height and weight, and there was no trend for certain positions being heavier or lighter relative to their weight.

Splitting height vs kicks against position helped explain that part of the difference may be due to taller players being Rucks, and also kicking less. However, curiously, every single position had a negative correlation between height and kicks, for which I have no obvious explanation.

Were there any interesting or surprising interactions between features?

The most interesting relationship found was the only marginal improvement in kicking accuracy of players playing in forward positions, compared to players in other positions.

Final Plots and Summary

Plot One

Description One

The first plot illustrates the physical characteristics of players in different positions. The player position that most stands out is Ruck, who tend to be taller. Additionally, half-forwards and ‘other’ appear to have a moderate share of taller players, although there is no trend for taller players for forwards, who are curiously the shortest of players. There is no real difference in player weight between different positions; in general all players follow the same weight vs. height trend.

Plot Two

Description Two

The second plot shows the accuracy of players accross various positions. Surprisingly, despite playing a role which requires frequently kicking for goal, Forwards and Half-Forwards are only slightly more accurate than other players, median 60% vs 57.8%. However, the long, high tail of the box plot suggests that the best forwards and half forwards are very accurate; over 75%.

Plot Three

Description Three

The plot is interesting as it illustrates two interesting qualities. The first is a change in the way the game is played over the last 15 years, with a significant increase in the number of handballs used by all teams over the period 2003-2008. Additionally, it shows a massive outlier in the usage of the handball by Geelong, one of the strongest teams in this period, which possibly drove the change in the game. However another strong team in this period, Collingwood, trailed the league in the use of the handball; they had a game more oriented around kicking.


Reflection

This analysis started to uncover some interesting characteristics in the broad dataset that has been collected from AFL statistics. In general it appears that the dataset that has been constructed is reasonalbly well formatted and does not have significant problems, although certain statistics have only been collected in more recent years.

The raw data collected for the project was quite broad, one dataframe was 126876 rows by 41 variables. Additionally, some parameters of interest (e.g. player characteristics) required merging data from other datasources (which existed in seperate dataframes). To handle the numerous possibilities in analysis, many of the analyses have been based around aggregate statistcs. Consequently it can be argued that by reducing the data in such a manner that a lot of useful information may be lost. However these aggregations seemed necessary given the scope of the project.

Such a detailed and broad dataset invites itself for extensive further analysis. Of particular interest may involve the career evolution of individual players over time. This analysis could look at questions as to whether there are any early indicators of future success in player’s statistics near the beginning of their career, which would help list managers to decide which players are more likely to come good over time. This type of analysis would lend itself to statistical testing, and the construction of machine learning models.

Furthermore, no work was done in this exploratory analysis to relate team statistics and player statistics and to team outcomes, i.e. wins and losses. Given the ultimate objective of football is to win games, it would be interesting to see what factors most contribute to winning. In a similar vein, it would answer questions like ‘do the best 5 players or worst 5 players on a team mean the most in the context of winning or losing?’. Given AFL teams are constructed within a salary cap, this information would be useful when deciding to spend a large portion of the cap on a few superstars, or whether to seek a highly balanced list.